Spiritual Thought

DC 6:36: “Look unto me in every thought; doubt not, fear not.”

“As an apostle of the Lord Jesus Christ, I invoke these blessings upon you, that as you look to the Savior and trust in Him, you will be blessed with hope to overcome perplexity, with spiritual settledness to cut through commotion, with ears to hear and a heart to always remember the word of the Lord, and with the discernment to see things as they really are.”

David A. Bednar - BYU Speeches, April 16th, 2021

Grading Exercises

Remember that you grade your own submitted exercises using the rubric specified in each posted exercise solution, adding comments (using the comment feature for Word documents) where your solution differs from the posted solution.

Marketing Analytics Process

## Motivating Example

Imagine your manager at Patagonia is happy with your ability to filter and organize data, but she wants you to find another way to share your findings with stakeholders in a presentation you’ll be giving tomorrow. What are some concise and effective ways that you could communicate data about Patagonia’s customer base?

Discrete Data

Remember that summarizing data is initially all about discovery, the heart of exploratory data analysis.

  • Computing statistics (i.e., numerical summaries).
  • Visualizing data (i.e., graphical summaries).

How we summarize depends on whether the data is discrete or continuous.

  • Discrete means “individually separate and distinct.”
  • Discrete data are also called qualitative or categorical.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Can you identify any discrete variables? What are their data types in R?

customer_data <- read_csv("customer_data.csv", show_col_types = FALSE)
glimpse(customer_data)
## Rows: 10,531
## Columns: 14
## $ customer_id    <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1…
## $ birth_year     <dbl> 1971, 1970, 1988, 1984, 1987, 1994, 1968, 1994, 1958, 1…
## $ gender         <chr> "Female", "Female", "Male", "Other", "Male", "Male", "M…
## $ income         <dbl> 73000, 31000, 35000, 64000, 58000, 164000, 39000, 69000…
## $ credit         <dbl> 742.0827, 749.3514, 542.2399, 573.9358, 644.2439, 553.6…
## $ married        <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "No"…
## $ college_degree <chr> "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "No"…
## $ region         <chr> "South", "West", "South", "Midwest", "West", "Midwest",…
## $ state          <chr> "DC", "WA", "AR", "MN", "HI", "MN", "MN", "KY", "NM", "…
## $ review_id      <dbl> 933551, NA, NA, NA, 501318, 109125, NA, 1959683, NA, 19…
## $ star_rating    <dbl> 4, NA, NA, NA, 5, 2, NA, 5, NA, 5, NA, 5, NA, NA, NA, N…
## $ review_time    <chr> "06 11, 2015", NA, NA, NA, "03 25, 2008", "06 7, 2013",…
## $ review_title   <chr> "Four Stars", NA, NA, NA, "Great Product!!", "Not at al…
## $ review_text    <chr> "everything's fine", NA, NA, NA, "I looked all over the…

A Word About Data Types

Data types in R — as in all programming languages — are more than simple labels. They serve as essential instructions that tell the computer how to interpret and manipulate the values we provide. Some common ones include:

Data Type Abbreviation Description
logical lgl Boolean values: TRUE, FALSE, or NA
integer int Whole numbers (e.g. 1, 42)
double (numeric) dbl Decimal numbers (e.g. 3.14, 2.0)
character chr Text strings (e.g. "hello", "R")
factor fct Categorical data with fixed levels (e.g. "Low", "High")

Summarize Discrete Data

An important statistic for a discrete variable is a count.

customer_data |> 
  count(region)
## # A tibble: 4 × 2
##   region        n
##   <chr>     <int>
## 1 Midwest    1101
## 2 Northeast  3224
## 3 South      1111
## 4 West       5095

How would I get a count by both region and college_degree (i.e., a cross-tab)?

customer_data |> 
  count(region, college_degree)
## # A tibble: 8 × 3
##   region    college_degree     n
##   <chr>     <chr>          <int>
## 1 Midwest   No               229
## 2 Midwest   Yes              872
## 3 Northeast No               640
## 4 Northeast Yes             2584
## 5 South     No               891
## 6 South     Yes              220
## 7 West      No               989
## 8 West      Yes             4106

Your Turn!

Write code that first filters the data to only unmarried customers, then counts the customers in each region (you will need to use code we learned in the last lecture to do this!)

Solution

customer_data |>
  filter(married == 'No') |>
  count(region)
## # A tibble: 4 × 2
##   region        n
##   <chr>     <int>
## 1 Midwest     578
## 2 Northeast  1791
## 3 South       587
## 4 West       2783

Visualize Data

{ggplot2} provides a consistent grammar of graphics built with layers.

  1. Data – Data to visualize.
  2. Aesthetics – Or “aes,” mapping graphical elements to data.
  3. Geometry – Or “geom,” the kind of graph representing the data.
  4. Facets, Labels, Scales, etc.

Visualize Discrete Data

Let’s plot our first summary (note how + is different from |>).

customer_data |> 
  count(region) |> 
  ggplot(aes(x = region, y = n)) +
  geom_col()

Visualize our second summary by adding the aesthetic fill = college_degree.

The geom position argument of the geom_col() function is set to stack by default. Try fill instead.

customer_data |> 
  count(region, college_degree) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col()

customer_data |> 
  count(region, college_degree) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "fill")

Facets

Facets allow us to visualize by another discrete variable. For example, is this relationship different depending on gender?

customer_data |> 
  count(region, college_degree, gender) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "fill") +
  facet_wrap(~ gender)

Labels and Scales

It’s no longer a count on the y-axis. Let’s change the labels.

customer_data |> 
  count(region, college_degree, gender) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "fill") +
  facet_wrap(~ gender) +
  labs(
    title = "Proportion of Customers with College Degrees by Region and Gender",
    subtitle = "Based on 10,531 Customers in the CRM Database",
    x = "Region",
    y = "Proportion"
  )

What about the legend? And these colors?

customer_data |> 
  count(region, college_degree, gender) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "fill") +
  facet_wrap(~ gender) +
  labs(
    title = "Proportion of Customers with College Degrees by Region and Gender",
    subtitle = "Based on 10,531 Customers in the CRM Database",
    x = "Region",
    y = "Proportion"
  ) +
  scale_fill_manual(
    name = "College Degree",
    values = c("cornflowerblue", "navy")
  )

Your Turn!

Look at the ggplot below (also found in your starter code). Figure out what summary it is displaying, and then add polished labels and colors to create a publication-ready graph.

customer_data |> 
  filter(region == "Northeast") |>
  count(state, married) |> 
  ggplot(aes(x = state, y = n, fill = married)) +
  geom_col(position = "fill")

Possible Solution

customer_data |> 
  filter(region == "Northeast") |>
  count(state, married) |> 
  ggplot(aes(x = state, y = n, fill = married)) +
  geom_col(position = "fill") +
  labs(
    title = "Proportion of Married Customers in Each Northeast State",
    subtitle = "Based on 3,224 Customers in the CRM Database",
    x = "State",
    y = "Proportion"
  ) +
  scale_fill_manual(
    name = "Marital Status",
    values = c("#B2AC88", "#4b5320")
  )

Text Data

Text data is also discrete but it is unstructured.

  • Authors can express themselves freely.
  • The same idea can be expressed in many ways.

What sort of structure might we impose on text data so we can visualize it?

Tokenize Text Data

We can use unnest_tokens() to tokenize the text (i.e., split it into individual words or tokens).

library(tidytext)

review_data <- customer_data |>
  select(customer_id, review_text) |> 
  unnest_tokens(word, review_text)

review_data
## # A tibble: 165,510 × 2
##    customer_id word        
##          <dbl> <chr>       
##  1        1001 everything's
##  2        1001 fine        
##  3        1002 <NA>        
##  4        1003 <NA>        
##  5        1004 <NA>        
##  6        1005 i           
##  7        1005 looked      
##  8        1005 all         
##  9        1005 over        
## 10        1005 the         
## # ℹ 165,500 more rows

Summarize Text Data

With the text data tokenized, we can compute counts just like other discrete data.

review_data |> 
  count(word) |> 
  arrange(desc(n))
## # A tibble: 10,178 × 2
##    word      n
##    <chr> <int>
##  1 the    7512
##  2 <NA>   7373
##  3 and    4633
##  4 i      4486
##  5 a      4176
##  6 to     3949
##  7 it     3581
##  8 for    2531
##  9 is     2419
## 10 of     2106
## # ℹ 10,168 more rows

Drop Missing Data

Missing values are (and should be) encoded as NA.

review_data <- review_data |> 
  drop_na(word)

review_data
## # A tibble: 158,137 × 2
##    customer_id word        
##          <dbl> <chr>       
##  1        1001 everything's
##  2        1001 fine        
##  3        1005 i           
##  4        1005 looked      
##  5        1005 all         
##  6        1005 over        
##  7        1005 the         
##  8        1005 internet    
##  9        1005 to          
## 10        1005 find        
## # ℹ 158,127 more rows

Remove Stop Words

Commonly used words aren’t very informative and are referred to as stop words.

stop_words
## # A tibble: 1,149 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ℹ 1,139 more rows

This is just a data frame, and we know how to join data frames!

An anti join returns rows that don’t have matching IDs, keeping only the columns from the “left” data frame. (Think of it as the opposite of an inner join.)

review_data <- review_data |>
  anti_join(stop_words, join_by(word))

review_data |> 
  count(word) |> 
  arrange(desc(n))
## # A tibble: 9,552 × 2
##    word        n
##    <chr>   <int>
##  1 fit       403
##  2 product   378
##  3 easy      338
##  4 quality   330
##  5 nice      324
##  6 bag       287
##  7 price     281
##  8 time      278
##  9 love      264
## 10 size      247
## # ℹ 9,542 more rows

Visualize Word Counts

review_data |> 
  count(word) |> 
  arrange(desc(n)) |> 
  ggplot(aes(x = word, y = n)) +
  geom_col()

What can we do to make this plot readable?

Factors

Unlike a character variable, a factor can include information about order.

  • A factor’s levels are numeric values that encode order.
  • A factor’s labels are the character string associated with each level.
review_data |> 
  count(word) |> 
  arrange(desc(n)) |> 
  slice(1:10) |> 
  mutate(word = fct_reorder(word, n)) |>
  ggplot(aes(x = n, y = word)) +
  geom_col()

Live Coding

Going back to our prompt from the beginning, imagine your manager specifically wants you to present on the customers in the South region, and provide a summary of what these customers are like, and what matters to them. How could we apply what we’ve learned today to accomplish this?

Wrapping Up

Summary

  • Computed counts, including tokenizing and counting text.
  • Practiced the basics of plotting with {ggplot2}.

Next Time

  • Summarizing continuous data with {dplyr}.
  • Visualizing continuous data with {ggplot2}.

Supplementary Material

  • R for Data Science (2e) Chapters 2 and 18

Artwork by @allison_horst

Exercise 3

In RStudio, create a new Quarto document and do the following.

  1. Load the tidyverse.
  2. Import and explore customer_data using the functions we’ve covered.
  3. Provide at least one interesting numeric summary and one interesting visualization using discrete variables only.
  4. Practice good coding conventions: Comment often, write in consecutive lines of code using the |>, and use the demonstrated style (e.g., variable names, spacing within functions).
  5. Export the R script and upload to Canvas.